First, removing any missing data is not part of a data cleaning process. Indeed, no one likes missing data, but it is dangerous to assume that it can simply be removed or replaced. Sometimes missing data tells us something important about whatever it is that we’re measuring (i.e. the value of the variable that is missing may be related to - the reason it is missing). Such data is called Missing not at Random, or MNAR.

Here’s what messy data look like

In the final chapter of this course, you will be presented with a messy, real-world dataset containing an entire year’s worth of weather data from Boston, USA. Among other things, you’ll be presented with variables that contain column names, column names that should be values, numbers coded as character strings, and values that are missing, extreme, and downright erroneous!

# Read weather RDS data
weather <- readRDS("../xDatasets/weather.rds")

# View the first 6 rows of data
weather %>%
  head() %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", , font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
X year month measure X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 X22 X23 X24 X25 X26 X27 X28 X29 X30 X31
1 2014 12 Max.TemperatureF 64 42 51 43 42 45 38 29 49 48 39 39 42 45 42 44 49 44 37 36 36 44 47 46 59 50 52 52 41 30 30
2 2014 12 Mean.TemperatureF 52 38 44 37 34 42 30 24 39 43 36 35 37 39 37 40 45 40 33 32 33 39 45 44 52 44 45 46 36 26 25
3 2014 12 Min.TemperatureF 39 33 37 30 26 38 21 18 29 38 32 31 32 33 32 35 41 36 29 27 30 33 42 41 44 37 38 40 30 22 20
4 2014 12 Max.Dew.PointF 46 40 49 24 37 45 36 28 49 45 37 28 28 29 33 42 46 34 25 30 30 39 45 46 58 31 34 42 26 10 8
5 2014 12 MeanDew.PointF 40 27 42 21 25 40 20 16 41 39 31 27 26 27 29 36 41 30 22 24 27 34 42 44 43 29 31 35 20 4 5
6 2014 12 Min.DewpointF 26 17 24 13 12 36 -3 3 28 37 27 25 24 25 27 30 32 26 20 20 25 25 37 41 29 28 29 27 10 -6 1
# View the last 6 rows of data
weather %>%
  tail() %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", , font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
X year month measure X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 X22 X23 X24 X25 X26 X27 X28 X29 X30 X31
281 281 2015 12 Mean.Wind.SpeedMPH 6 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
282 282 2015 12 Max.Gust.SpeedMPH 17 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
283 283 2015 12 PrecipitationIn 0.14 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
284 284 2015 12 CloudCover 7 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
285 285 2015 12 Events Rain NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
286 286 2015 12 WindDirDegrees 109 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# View a condensed summary of the data
str(weather)
## 'data.frame':    286 obs. of  35 variables:
##  $ X      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ year   : int  2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
##  $ month  : int  12 12 12 12 12 12 12 12 12 12 ...
##  $ measure: chr  "Max.TemperatureF" "Mean.TemperatureF" "Min.TemperatureF" "Max.Dew.PointF" ...
##  $ X1     : chr  "64" "52" "39" "46" ...
##  $ X2     : chr  "42" "38" "33" "40" ...
##  $ X3     : chr  "51" "44" "37" "49" ...
##  $ X4     : chr  "43" "37" "30" "24" ...
##  $ X5     : chr  "42" "34" "26" "37" ...
##  $ X6     : chr  "45" "42" "38" "45" ...
##  $ X7     : chr  "38" "30" "21" "36" ...
##  $ X8     : chr  "29" "24" "18" "28" ...
##  $ X9     : chr  "49" "39" "29" "49" ...
##  $ X10    : chr  "48" "43" "38" "45" ...
##  $ X11    : chr  "39" "36" "32" "37" ...
##  $ X12    : chr  "39" "35" "31" "28" ...
##  $ X13    : chr  "42" "37" "32" "28" ...
##  $ X14    : chr  "45" "39" "33" "29" ...
##  $ X15    : chr  "42" "37" "32" "33" ...
##  $ X16    : chr  "44" "40" "35" "42" ...
##  $ X17    : chr  "49" "45" "41" "46" ...
##  $ X18    : chr  "44" "40" "36" "34" ...
##  $ X19    : chr  "37" "33" "29" "25" ...
##  $ X20    : chr  "36" "32" "27" "30" ...
##  $ X21    : chr  "36" "33" "30" "30" ...
##  $ X22    : chr  "44" "39" "33" "39" ...
##  $ X23    : chr  "47" "45" "42" "45" ...
##  $ X24    : chr  "46" "44" "41" "46" ...
##  $ X25    : chr  "59" "52" "44" "58" ...
##  $ X26    : chr  "50" "44" "37" "31" ...
##  $ X27    : chr  "52" "45" "38" "34" ...
##  $ X28    : chr  "52" "46" "40" "42" ...
##  $ X29    : chr  "41" "36" "30" "26" ...
##  $ X30    : chr  "30" "26" "22" "10" ...
##  $ X31    : chr  "30" "25" "20" "8" ...

Exploring raw data

Getting a feel for your data

The first thing to do when you get your hands on a new dataset is to understand its structure. There are several ways to go about this in R, each of which may reveal different issues with your data that require attention.

In this course, we are only concerned with data that can be expressed in table format (i.e. two dimensions, rows and columns). As you may recall from earlier courses, tables in R often have the type data.frame. You can check the class of any object in R with the class() function.

Once you know that you are dealing with tabular data, you may also want to get a quick feel for the contents of your data. Before printing the entire dataset to the console, it’s probably worth knowing how many rows and columns there are. The dim() command tells you this.

# Read BMI data
library(readr)
bmi <- read_csv("../xDatasets/bmi_clean.csv")

# Check the class of bmi
class(bmi)
## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"
# Check the dimensions of bmi
dim(bmi)
## [1] 199  30
# View the column names of bmi
names(bmi)
##  [1] "Country" "Y1980"   "Y1981"   "Y1982"   "Y1983"   "Y1984"   "Y1985"  
##  [8] "Y1986"   "Y1987"   "Y1988"   "Y1989"   "Y1990"   "Y1991"   "Y1992"  
## [15] "Y1993"   "Y1994"   "Y1995"   "Y1996"   "Y1997"   "Y1998"   "Y1999"  
## [22] "Y2000"   "Y2001"   "Y2002"   "Y2003"   "Y2004"   "Y2005"   "Y2006"  
## [29] "Y2007"   "Y2008"

Viewing the structure of your data

Since bmi doesn’t have a huge number of columns, you can view a quick snapshot of your data using the str() (for structure) command. In addition to the class and dimensions of your entire dataset, str() will tell you the class of each variable and give you a preview of its contents.

Although we won’t go into detail on the dplyr package in this lesson (see the Data Manipulation in R with dplyr course), the glimpse() function from dplyr is a slightly cleaner alternative to str(). str() and glimpse() give you a preview of your data, which may reveal issues with the way columns are labelled, how variables are encoded, etc.

# Check the structure of bmi
str(bmi)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 199 obs. of  30 variables:
##  $ Country: chr  "Afghanistan" "Albania" "Algeria" "Andorra" ...
##  $ Y1980  : num  21.5 25.2 22.3 25.7 20.9 ...
##  $ Y1981  : num  21.5 25.2 22.3 25.7 20.9 ...
##  $ Y1982  : num  21.5 25.3 22.4 25.7 20.9 ...
##  $ Y1983  : num  21.4 25.3 22.5 25.8 20.9 ...
##  $ Y1984  : num  21.4 25.3 22.6 25.8 20.9 ...
##  $ Y1985  : num  21.4 25.3 22.7 25.9 20.9 ...
##  $ Y1986  : num  21.4 25.3 22.8 25.9 21 ...
##  $ Y1987  : num  21.4 25.3 22.8 25.9 21 ...
##  $ Y1988  : num  21.3 25.3 22.9 26 21 ...
##  $ Y1989  : num  21.3 25.3 23 26 21.1 ...
##  $ Y1990  : num  21.2 25.3 23 26.1 21.1 ...
##  $ Y1991  : num  21.2 25.3 23.1 26.2 21.1 ...
##  $ Y1992  : num  21.1 25.2 23.2 26.2 21.1 ...
##  $ Y1993  : num  21.1 25.2 23.3 26.3 21.1 ...
##  $ Y1994  : num  21 25.2 23.3 26.4 21.1 ...
##  $ Y1995  : num  20.9 25.3 23.4 26.4 21.2 ...
##  $ Y1996  : num  20.9 25.3 23.5 26.5 21.2 ...
##  $ Y1997  : num  20.8 25.3 23.5 26.6 21.2 ...
##  $ Y1998  : num  20.8 25.4 23.6 26.7 21.3 ...
##  $ Y1999  : num  20.8 25.5 23.7 26.8 21.3 ...
##  $ Y2000  : num  20.7 25.6 23.8 26.8 21.4 ...
##  $ Y2001  : num  20.6 25.7 23.9 26.9 21.4 ...
##  $ Y2002  : num  20.6 25.8 24 27 21.5 ...
##  $ Y2003  : num  20.6 25.9 24.1 27.1 21.6 ...
##  $ Y2004  : num  20.6 26 24.2 27.2 21.7 ...
##  $ Y2005  : num  20.6 26.1 24.3 27.3 21.8 ...
##  $ Y2006  : num  20.6 26.2 24.4 27.4 21.9 ...
##  $ Y2007  : num  20.6 26.3 24.5 27.5 22.1 ...
##  $ Y2008  : num  20.6 26.4 24.6 27.6 22.3 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Country = col_character(),
##   ..   Y1980 = col_double(),
##   ..   Y1981 = col_double(),
##   ..   Y1982 = col_double(),
##   ..   Y1983 = col_double(),
##   ..   Y1984 = col_double(),
##   ..   Y1985 = col_double(),
##   ..   Y1986 = col_double(),
##   ..   Y1987 = col_double(),
##   ..   Y1988 = col_double(),
##   ..   Y1989 = col_double(),
##   ..   Y1990 = col_double(),
##   ..   Y1991 = col_double(),
##   ..   Y1992 = col_double(),
##   ..   Y1993 = col_double(),
##   ..   Y1994 = col_double(),
##   ..   Y1995 = col_double(),
##   ..   Y1996 = col_double(),
##   ..   Y1997 = col_double(),
##   ..   Y1998 = col_double(),
##   ..   Y1999 = col_double(),
##   ..   Y2000 = col_double(),
##   ..   Y2001 = col_double(),
##   ..   Y2002 = col_double(),
##   ..   Y2003 = col_double(),
##   ..   Y2004 = col_double(),
##   ..   Y2005 = col_double(),
##   ..   Y2006 = col_double(),
##   ..   Y2007 = col_double(),
##   ..   Y2008 = col_double()
##   .. )
# Load dplyr
library(dplyr)

# Check the structure of bmi, the dplyr way
glimpse(bmi)
## Observations: 199
## Variables: 30
## $ Country <chr> "Afghanistan", "Albania", "Algeria", "Andorra", "Angol...
## $ Y1980   <dbl> 21.48678, 25.22533, 22.25703, 25.66652, 20.94876, 23.3...
## $ Y1981   <dbl> 21.46552, 25.23981, 22.34745, 25.70868, 20.94371, 23.3...
## $ Y1982   <dbl> 21.45145, 25.25636, 22.43647, 25.74681, 20.93754, 23.4...
## $ Y1983   <dbl> 21.43822, 25.27176, 22.52105, 25.78250, 20.93187, 23.5...
## $ Y1984   <dbl> 21.42734, 25.27901, 22.60633, 25.81874, 20.93569, 23.6...
## $ Y1985   <dbl> 21.41222, 25.28669, 22.69501, 25.85236, 20.94857, 23.7...
## $ Y1986   <dbl> 21.40132, 25.29451, 22.76979, 25.89089, 20.96030, 23.8...
## $ Y1987   <dbl> 21.37679, 25.30217, 22.84096, 25.93414, 20.98025, 23.9...
## $ Y1988   <dbl> 21.34018, 25.30450, 22.90644, 25.98477, 21.01375, 24.0...
## $ Y1989   <dbl> 21.29845, 25.31944, 22.97931, 26.04450, 21.05269, 24.1...
## $ Y1990   <dbl> 21.24818, 25.32357, 23.04600, 26.10936, 21.09007, 24.2...
## $ Y1991   <dbl> 21.20269, 25.28452, 23.11333, 26.17912, 21.12136, 24.3...
## $ Y1992   <dbl> 21.14238, 25.23077, 23.18776, 26.24017, 21.14987, 24.4...
## $ Y1993   <dbl> 21.06376, 25.21192, 23.25764, 26.30356, 21.13938, 24.5...
## $ Y1994   <dbl> 20.97987, 25.22115, 23.32273, 26.36793, 21.14186, 24.6...
## $ Y1995   <dbl> 20.91132, 25.25874, 23.39526, 26.43569, 21.16022, 24.6...
## $ Y1996   <dbl> 20.85155, 25.31097, 23.46811, 26.50769, 21.19076, 24.7...
## $ Y1997   <dbl> 20.81307, 25.33988, 23.54160, 26.58255, 21.22621, 24.7...
## $ Y1998   <dbl> 20.78591, 25.39116, 23.61592, 26.66337, 21.27082, 24.8...
## $ Y1999   <dbl> 20.75469, 25.46555, 23.69486, 26.75078, 21.31954, 24.9...
## $ Y2000   <dbl> 20.69521, 25.55835, 23.77659, 26.83179, 21.37480, 24.9...
## $ Y2001   <dbl> 20.62643, 25.66701, 23.86256, 26.92373, 21.43664, 25.0...
## $ Y2002   <dbl> 20.59848, 25.77167, 23.95294, 27.02525, 21.51765, 25.1...
## $ Y2003   <dbl> 20.58706, 25.87274, 24.05243, 27.12481, 21.59924, 25.2...
## $ Y2004   <dbl> 20.57759, 25.98136, 24.15957, 27.23107, 21.69218, 25.2...
## $ Y2005   <dbl> 20.58084, 26.08939, 24.27001, 27.32827, 21.80564, 25.3...
## $ Y2006   <dbl> 20.58749, 26.20867, 24.38270, 27.43588, 21.93881, 25.5...
## $ Y2007   <dbl> 20.60246, 26.32753, 24.48846, 27.53363, 22.08962, 25.6...
## $ Y2008   <dbl> 20.62058, 26.44657, 24.59620, 27.63048, 22.25083, 25.7...
# View a summary of bmi
sum_bmi <- as.data.frame(do.call(cbind, lapply(bmi, summary)))

sum_bmi[,-1] %>% 
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", , font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
Y1980 Y1981 Y1982 Y1983 Y1984 Y1985 Y1986 Y1987 Y1988 Y1989 Y1990 Y1991 Y1992 Y1993 Y1994 Y1995 Y1996 Y1997 Y1998 Y1999 Y2000 Y2001 Y2002 Y2003 Y2004 Y2005 Y2006 Y2007 Y2008
Min. 19.01394 19.03902 19.06804 19.09675 19.13046 19.16397 19.1974 19.23481 19.2709 19.31105 19.3515 19.39625 19.45212 19.51493 19.58757 19.66996 19.71305 19.7424 19.76632 19.79587 19.82768 19.86357 19.83696 19.80717 19.78927 19.78781 19.8018 19.8291 19.86692
1st Qu. 21.27082 21.30784 21.362375 21.41988 21.44577 21.466555 21.492555 21.501815 21.51734 21.548785 21.57262 21.599555 21.6513 21.743 21.758025 21.82634 21.89233 21.93932 21.99765 22.04246 22.11507 22.21658 22.291485 22.368735 22.44919 22.53678 22.6267 22.72645 22.832135
Median 23.31424 23.39054 23.46016 23.56861 23.63584 23.73109 23.82157 23.87321 23.92801 24.0271 24.14473 24.1979 24.19453 24.27485 24.35741 24.40843 24.41998 24.49745 24.4885 24.60608 24.6554 24.73358 24.80671 24.88965 25.00262 25.11463 25.23747 25.36476 25.49887
Mean 23.1547664321608 23.2102073869347 23.2648034673367 23.3172157286432 23.3705114572864 23.423494321608 23.4772568844221 23.5319515577889 23.5889472361809 23.6470072864322 23.7052090452261 23.7631105527638 23.8210563819095 23.8792342713568 23.9383301005025 24.001183718593 24.0682096984925 24.1390354773869 24.2115721105528 24.2864309045226 24.3642196984925 24.4430578894472 24.5241371859296 24.608454120603 24.6985252763819 24.7921330150754 24.8903988944724 24.9926272361809 25.0969846733668
3rd Qu. 24.82028 24.88683 24.94037 25.01943 25.056045 25.10566 25.20339 25.27423 25.335225 25.370375 25.38608 25.41697 25.47902 25.539305 25.6153 25.697265 25.778105 25.85415 25.936495 26.013855 26.086555 26.185345 26.295505 26.38302 26.473265 26.527285 26.590595 26.66216 26.82232
Max. 28.12449 28.35509 28.58248 28.81528 29.04548 29.28105 29.52002 29.75319 29.97877 30.20436 30.42198 30.63617 30.84639 31.04119 31.23113 31.41477 31.59122 31.76624 31.94707 32.13097 32.31834 32.50635 32.70215 32.89697 33.09517 33.2964 33.49282 33.69373 33.89634

Visualizing your data

There are many ways to visualize data. Since this is not a course about data visualization, we will only touch on two types of plots that may be useful for quickly identifying extreme or suspicious values in your data: histograms and scatter plots.

A histogram, created with the hist() function, takes a vector (i.e. column) of data, breaks it up into intervals, then plots as a vertical bar the number of instances within each interval. A scatter plot, created with the plot() function, takes two vectors (i.e. columns) of data and plots them as a series of (x, y) coordinates on a two-dimensional plane.

# Histogram of BMIs from 2008
hist(bmi$Y2008)

# Scatter plot comparing BMIs from 1980 to those from 2008
plot(bmi$Y1980, bmi$Y2008)